Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

Based on last year's experience when the bank ran a campaign for liability customers to take a personal loan and the conversion rate of over 9% success was achieved. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To build a model that will help marketing department to predict whether a liability customer will buy a personal loan or not.

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant in the identification of a liability customer to take a personal loan.
  3. Which segment of customers should be targeted more.

Dataset

Loading Libraries

Loading Data

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Summary of the dataset.

Data Pre-processing

Univariate Analysis

Observations on Age

Observations on Experience

Observations on Income

Observations on ZipCode

Observations on Family size

Observations on CCAvg

Observations on Education

Observations on Mortgage

Observations on Personal loan

Observations on Securities Account

Observations on Online

Observations on Credit card

Count Plot for Credit Card

Count Plot for Online Use

Count Plot for Securities account

Count Plot for Personal account in last campaign

Count Plot for CD account

Bivariate analysis

Correlation matrix graphical representation

PairPlots for attributes

Influence of income and education on personal loan

Education vs Personal Loan

Age vs Personal Loan

Income vs Personal Loan

Experience vs Personal Loan

CCAvg vs Personal Loan

Securities Account vs Personal Loan

CD Account vs Personal Loan

Online vs Personal Loan

Family vs Personal Loan

CreditCard vs Personal Loan

ID column no relation with the Personal loan borrowal so dropping it

Modelling using Decision Tree

Split Data

Build Decision Tree Model

Scoring our Decision Tree

What does a bank want?

Which scenario has a greater ?

Since we don't want marketing campaign to go behind customers having less chance of personal loan we should use Recall as a metric of model evaluation instead of accuracy.

Confusion Matrix

Visualizing the Decision Tree

The tree above is very complex, such a tree often overfits.

Reducing over fitting

Confusion Matrix - decision tree with depth restricted to 3

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

Confusion Matrix - decision tree with tuned hyperparameters

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Accuracy vs alpha for training and testing sets

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

Since accuracy isn't the right metric for our data we would want high recall

Confusion Matrix - post-pruned decision tree

Visualizing the Decision Tree

Decision tree with post-pruning is giving the highest recall on test set.

Data Pre-Processing

Treating Outliers

Modelling using Logistic regression

Data Split

Building the model

What does a bank want?

Which scenario has a greater ?

Since we don't want marketing campaign to go behind customers having less chance of personal loan we should use Recall as a metric of model evaluation instead of accuracy.

How to reduce this loss i.e need to reduce False Negatives?

First, let's create two functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

ROC-AUC

Finding the coefficients

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Model Performance Improvement

F1 score on both training and testing data has increased

Let's use Precision-Recall curve and see if we can find a better threshold

Sequential Feature Selector

Selecting subset of important features using Sequential Feature Selector method

Why we should do feature selection?

Let's Look at model performance

Conclusion

Comparison between decision Tree and Logistic Regression

The Recall using decision tree is way higher 0.95 on training and 0.89 on testing than best performing logical regression model. Decision Tree algorithm is better performing than Logistic regression.

Recommendations